Dirty Data Project

Dataset 1 : Candy Data Project

All i can say is …phoooffffff … what a data set to work on …was a good learning experience though…

Cleaning notes

There were 3 data sets involved in this project, boing-boing-candy-2015.xlsx, boing-boing-candy-2016.xlsx and boing-boing-candy-2017.xlsx.

To clean it I did the following :

  1. First task was to identify the relevant data required as every data set had more than 100 columns each, this was done by looking at the questions to be answered and by identifying the variables to work with in each of the data set.
  2. I devided my task of cleaning the data into two phases. Phase 1 : Where in i called it primary cleaning involved
    1 Reading the files in the relevant forat using the read_excel function
    2 Adding a variable Year which would be needed further for analysys using the mumate function
    3 Cleaning the variable names using the clean_variable function
    4 Pivoting the data at the relevant variables using the pivot_longer function.
    5 Selecting the relevant variables required for analysis using the select function.
    6 Re-naming the data with the relavant names using the names function.
    7 Filtering the age column using the a combination of mutate, as.integer and ifelse statement.
    8 Finally the same exercise was repeteaded for the other two data sets from 2016 and 2017.
  3. Then came Phase 2, which i felt was a mammoth of a task, mainly involved cleaning of country names, minute details looking into the age and restricting it between 1 and 99, checking the formats in which data was pharsed,etc…

Here is my code:

library(tidyverse)
library(here)
library(janitor)

READING THE DATA-SET

candy_data <- read_csv(here("clean_data/candy_all_years_clean.csv"),
                       col_types = cols(
                            gender = col_character(),
                            country = col_character()
                            ))

Analysis questions

1> What is the total number of candy ratings given across the three years. (number of candy ratings, not number of raters. Don’t count missing values)

candy_data %>%
  filter(!is.na(rating)) %>%
   summarise(total_no_of_candy_ratings = n())
## # A tibble: 1 x 1
##   total_no_of_candy_ratings
##                       <int>
## 1                    762355

2> What was the average age of people who are going out trick or treating and the average age of people 3. not going trick or treating?

candy_data %>%
  group_by(trick_or_treat_yourself) %>%
  summarise( average_age = mean(age,na.rm = TRUE))
## # A tibble: 3 x 2
##   trick_or_treat_yourself average_age
##   <chr>                         <dbl>
## 1 No                             39.2
## 2 Yes                            35.1
## 3 <NA>                           42.3

3> For each of joy, despair and meh, which candy bar revived the most of these ratings?

candy_data %>%
  filter(!is.na(rating))%>%
  group_by(rating, candy_name) %>%
  summarise( count = n()) %>%
  filter(count == max(count))
## # A tibble: 3 x 3
## # Groups:   rating [3]
##   rating  candy_name               count
##   <chr>   <chr>                    <int>
## 1 DESPAIR broken_glow_stick         7905
## 2 JOY     any_full_sized_candy_bar  7589
## 3 MEH     lollipops                 1570

4> How many people rated Starburst as despair?

candy_data %>%
  filter(candy_name == "starburst", rating == "DESPAIR") %>%
  summarise (total_starburst_despair = n())
## # A tibble: 1 x 1
##   total_starburst_despair
##                     <int>
## 1                    1990

For the next three questions, count despair as -1, joy as +1 and meh as 0.

5> What was the most popular candy bar by this rating system for each gender in the dataset? # First lets create a seperate column called emotion_numeric using the recode function

candy_data_numeric  <- candy_data %>%
  mutate(rating_numeric = recode(rating, DESPAIR = -1, JOY = 1, MEH = 0))
candy_data_numeric %>%
  group_by(gender, candy_name)%>%
  summarise(sum = sum(rating_numeric, na.rm = TRUE)) %>%
  filter(sum == max(sum))
## # A tibble: 5 x 3
## # Groups:   gender [5]
##   gender             candy_name                 sum
##   <chr>              <chr>                    <dbl>
## 1 Female             any_full_sized_candy_bar   875
## 2 I'd rather not say any_full_sized_candy_bar    68
## 3 Male               any_full_sized_candy_bar  1584
## 4 Other              any_full_sized_candy_bar    37
## 5 <NA>               any_full_sized_candy_bar  4618

6> What was the most popular candy bar in each year?

candy_data_numeric %>%
  group_by(year, candy_name)%>%
  summarise(sum = sum(rating_numeric, na.rm = TRUE)) %>%
  filter( sum == max(sum))
## # A tibble: 3 x 3
## # Groups:   year [3]
##    year candy_name                 sum
##   <dbl> <chr>                    <dbl>
## 1  2015 any_full_sized_candy_bar  4603
## 2  2016 any_full_sized_candy_bar  1037
## 3  2017 any_full_sized_candy_bar  1542

7> What was the most popular candy bar by this rating for people in US, Canada, UK and all other countries?

candy_data_numeric%>%
  mutate(country = if_else(country %in% c("united states", "canada", "united kingdom"), country, "other")) %>%
  group_by(country, candy_name)%>%
  summarise(sum = sum(rating_numeric, na.rm = TRUE)) %>%
  filter( sum == max(sum))
## # A tibble: 5 x 3
## # Groups:   country [4]
##   country        candy_name                 sum
##   <chr>          <chr>                    <dbl>
## 1 canada         any_full_sized_candy_bar   254
## 2 other          any_full_sized_candy_bar  5761
## 3 united kingdom lindt_truffle               15
## 4 united kingdom rolos                       15
## 5 united states  any_full_sized_candy_bar  1153